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Abstract 

In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic simi- 
larity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic 
density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to 
be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given 
an uncertain database object B, an uncertain reference object R and a set V of uncertain database objects in a 
multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in T> that are 
closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. 
Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for 
conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping 
correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed 
technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large 
uncertain databases. 

I. Introduction 

In the past two decades, there has been a great deal of interest in developing efficient and effective 
methods for similarity queries, e.g. A; -nearest neighbor search, reverse /c -nearest neighbor search and 
ranking in spatial, temporal, multimedia and sensor databases. Many applications dealing with such data 
have to cope with uncertain or imprecise data. 

In this work, we introduce a novel scalable pruning approach to identify candidates for a class of prob- 
abilistic similarity queries. Generally spoken, probabilistic similarity queries compute for each database 
object o e "D the probability that a given query predicate is fulfilled. Our approach addresses probabiUstic 
similarity queries where the query predicate is based on object (distance) relations, i.e. the event that an 
object B belongs to the result set depends on the relation of its distance to the query object R and the 
distance of another object A to the query object. Exemplarily, we apply our novel pruning method to 
the most prominent queries of the above mentioned class, including the probabilistic A; -nearest neighbor 
(P/cNN) query, the probabilistic reverse A; -nearest neighbor (PRA;NN) query and the probabilistic inverse 
ranking query. 

A. Uncertainty Model 

In this paper, we assume that the database V consists of multi- attribute objects oi, o^v that may have 
uncertain attribute values. An uncertain attribute is defined as follows: 

Definition 1 (Probabilistic Attribute). A probabilistic attribute attr of object Oi is a random variable 
drawn from a probability distribution with density function /f 
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Fig. 1. A dominates B w.r.t. R with high probability. 

An uncertain object has at least one uncertain attribute value. The function fi denotes the multi- 
dimensional probability density distribution (PDF) of Oj that combines all density functions for all prob- 
abilistic attributes attr of Oj. 

Following the convention of uncertain databases [|6l, [H, [El, [El, ^B, ^M, we assume that fi 
is (minimally) bounded by an uncertainty region such that Vx ^ i?* : /j(x) = and 

/ fi{x)dx<l. 

Specifically, the case J^^ fi{x)dx < 1 implements existential uncertainty, i.e. object Oj may not exist in the 
database at all with a probability greater than zero. In this paper we focus on the case /^^ fi{x)dx = 1, 
but the proposed concepts can be easily adapted to existentially uncertain objects. Although our approach 
is also applicable for unbounded PDF, e.g., Gaussian PDF, here we assume fi exceeds zero only within 
a bounded region. This is a realistic assumption because the spectrum of possible values of attributes is 
usually bounded and it is commonly used in related work, e.g. [[8l, [[3 and [[6l. Even if fi is given as an 
unbounded PDF, a common strategy is to truncate PDF tails with negligible probabilities and normalize 
the resulting PDF. In specific, [6J shows that for a reasonable low truncation threshold, the impact on 
the accuracy of probabilistic ranking queries is quite low while having a very high impact on the query 
performance. In this way, each uncertain object can be considered as a rf-dimensional rectangle with an 
associated multi-dimensional object PDF (c.f. Figure [T]). Here, we assume that uncertain attributes may be 
mutually dependent. Therefore the object PDF can have any arbitrary form, and in general, cannot simply 
be derived from the marginal distribution of the uncertain attributes. Note that in many applications, a 
discrete uncertainty model is appropriate, meaning that the probability distribution of an uncertain object 
is given by a finite number of alternatives assigned with probabilities. This can be seen as a special case 
of our model. 

B. Problem Formulation 

We address the problem of detecting for a given uncertain object B the number of uncertain objects 
of an uncertain database V that are closer to {i.e. dominate) a reference object R than B. We call this 
number the domination count of B w.r.t. R as defined below: 

Definition 2 (Domination). Consider an uncertain database T) = {oi, oat} and an uncertain reference 
object R. Let A,B E V. Dom{A, B, R) is the random indicator variable that is 1, iff A dominates B 
w.r.t. R, formally: 
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, ifdist{a,r) < dist{b,r) 
Dom{A,B,R) = { WaeA,beB,reR 
, otherwise 

where a, h and r are samples drawn from the PDFs of A, B and R, respectively and dist is a distance 
function on vector objects^ 

Definition 3 (Domination Count). Consider an uncertain database V = {oi, ...,0]^} and an uncertain 
reference object R. For each uncertain object B ^V, let DomCount{B,R) be the random variable of 
the number of uncertain objects A & V (A^ B) that are closer to R than B: 

DomCount{B, R) = ^ Dom{A, B, R) 

DomCount(B, R) is the sum of — 1 non-necessarily identically distributed and non-necessarily 
independent Bernoulli variables. The problem solved in this paper is to efficiently compute the probability 
density distribution of DomCount{B, R)(B E V) formally introduced by means of the probabilistic 



domination (cf. Section III) and the probabilistic domination count (cf. Section IV). 

Determining domination is a central module for most types of similarity queries in order to identify 
true hits and true drops (pruning). In the context of probabilistic similarity queries, knowledge about the 
PDF of DomCount{B, R) can be used to find out if B satisfies the query predicate. For example, for 
a probabilistic 5NN query with probability threshold r = 10% and query object Q, an object B can be 
pruned (returned as a true hit), if the probability P(DomCount{B, Q) < 5) is less (more) than 10%. 

C. Overview 

Given an uncertain database V = {oi,...,on} and an uncertain reference object R, our objective is 
to efficiently derive the distribution of DomCount{B, R) for any uncertain object B E V and use it in 



the computation of probabilistic similarity queries. First (Section III), we build on the methodology of 
[fTSll to efficiently find the complete set of objects in V that definitely dominate (are dominated by) B 
w.r.t. R. At the same time, we find the set of objects whose dominance relationship to B is uncertain. 
Using a decomposition technique, for each object A in this set, we can derive a lower and an upper 



bound for PDom{A, B, R), i.e., the probability that A dominates B w.r.t. R. In Section IV, we show 
that due to dependencies between object distances to R, these probabilities cannot be combined in a 
straightforward manner to approximate the distribution of DomCount{B, R). We propose a solution 
that copes with these dependencies and introduce techniques that help to to compute the probabilistic 
domination count in an efficient way. In particular, we prove that the bounds of PDom{A, B, R) are 
mutually independent if they are computed without a decomposition of B and R. Then, we provide a class 
of uncertain generating functions that use these bounds to build the distribution of DomCount{B, R). We 
then propose an algorithm which progressively refines DomCount{B, R) by iteratively decomposing the 
objects that influence its computation (Section |V]). Section VI shows how to apply this iterative probabilistic 



domination count refinement process to evaluate several types of probabilistic similarity queries. In Section 



VIT we experimentally demonstrate the effectiveness and efficiency of our probabilistic pruning methods 



for various parameter settings on artificial and real-world datasets. 

II. Related Work 

The management of uncertain data has gained increasing interest in diverse application fields, e.g. sensor 
monitoring [|12ll . traffic analysis, location-based services [27 1 etc. Thus, modelling probabilistic databases 
has become very important in the literature, e.g. [[T]|, [|23ll . [i24il . In general, these models can be classified 



'We assume Euclidean distance for the remainder of tlie paper, but the techniques can be applied to any Lp norm. 
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in two types: discrete and continuous uncertainty models. Discrete models represent each uncertain object 
by a discrete set of alternative values, each associated with a probability. This model is in general adopted 
for probabilistic databases, where tuples are associated with existential probabilities, e.g. [fT4ll . [fT9ll , [|25l , 

ttH. 

In this work, we concentrate on the continuous model in which an uncertain object is represented by 
a probability density function (PDF) within the vector space. In general, similarity search methods based 
on this model involve expensive integrations of the PDFs, hence special approximation and indexing 
techniques for efficient query processing are typically employed [|T3l . Il26l . 

Uncertain similarity query processing has focused on various aspects. A lot of existing work dealing 
with uncertain data addresses probabilistic nearest neighbor (NN) queries for certain query objects [[TT|. 
[fTSll and for uncertain queries [fTTll . To reduce computational effort, [9J add threshold constraints in order 
to retrieve only objects whose probability of being the nearest neighbor exceeds a user-specified threshold 
to control the desired confidence required in a query answer. Similar semantics of queries in probabilistic 
databases are provided by Top- A; nearest neighbor queries |6|, where the k most probable results of being 
the nearest neighbor to a certain query point are returned. Existing solutions on probabilistic A;-nearest 
neighbor (/cNN) queries restrict to expected distances of the uncertain objects to the query object [|22i or 
also use a threshold constraint QO]. However, the use of expected distances does not adhere to the possible 
world semantics and may thus produce very inaccurate results, that may have a very small probability 
of being an actual result ( [|25l . [[T9ll ). Several approaches return the full result to queries as a ranking of 
probabilistic objects according to their distance to a certain query point H, [[T4|. [[T9ll . [|25l . However, 
all these prior works have in common that the query is given as a single (certain) point. To the best of 
our knowledge, A;-nearest neighbor queries as well as ranking queries on uncertain data, where the query 
object is allowed to be uncertain, have not been addressed so far. Probabilistic reverse nearest neighbor 
(RNN) queries have been addressed in tTJ to process them on data based on discrete and continuous 
uncertainty models. Similar to our solution, the uncertainty regions of the data are modelled by MBRs. 
Based on these approximations, the authors of [VJ are able to apply a combination of spatial, metric and 
probabilistic pruning criteria to efficiently answer queries. 

All of the above approaches that use MBRs as approximations for uncertain objects utilize the min- 
imum/maximum distance approximations in order to remove possible candidates. However, the pruning 
power can be improved using geometry-based pruning techniques as shown in [15|. In this context, [|20ll 
introduces a geometric pruning technique that can be utilized to answer monochromatic and bichromatic 
probabilistic RNN queries for arbitrary object distributions. 

The framework that we introduce in this paper can be used to answer probabilistic (threshold) A;NN 
queries and probabilistic reverse (threshold) A;NN queries as well as probabilistic ranking and inverse 
ranking queries for uncertain query objects. 

III. Similarity Domination on Uncertain Data 

In this section, we tackle the following problem: Given three uncertain objects A, B and i? in a 
multidimensional space W^, determine whether object A is closer to R than B w.r.t. a distance function 
defined on the objects in W^. If this is the case, we say A dominates B w.r.t. R. In contrast to [15], where 
this problem is solved for certain data, in the context of uncertain objects this domination relation is not 
a predicate that is either true or false, but rather a (dichotomous) random variable as defined in Definition 
[2j In the example depicted in Figure [1} there are three uncertain objects A, B and R, each bounded by a 
rectangle representing the possible locations of the object in M^. The PDFs of A, B and R are depicted 
as well. In this scenario, we cannot determine for sure whether object A dominates B w.r.t. R. However, 
it is possible to determine that object A dominates object B w.r.t. R with a high probability. The problem 
at issue is to determine the probabilistic domination probability defined as: 

Definition 4 (Probabilistic Domination). Given three uncertain objects A, B and R, the probabilistic 
domination PDom{A,B,R) denotes the probability that A dominates B w.r.t. R. 
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Naively, we can compute PDom{A,B,R) by simply integrating the probability of all possible worlds 
in which A dominates B w.r.t. R exploiting inter-object independency: 

PDom{A,B,R)= [ [ [ 5{a,h,r) ■ P{A = a) ■ P{B = h) ■ P{R = r)dadhdr, 

JaGA JbeB Jr&R 

where S{a, b, r) is the following indicator function: 



5(a, 6, r) 



if dist{a,r) < dist{b,r) 
else 



The problem of this naive approach is the computational cost of the triple-integral. The integrals of the 
PDFs of A, B and R may in general not be representable as a closed-form expression and the integral 
of 5{a,b,r) does not have a closed-from expression. Therefore, an expensive numeric approximation is 
required for this approach. In the rest of this section we propose methods that efficiently derive bounds 
for PDom{A, B, R), which can be used to prune objects avoiding integral computations. 

A. Complete Domination 

First, we show how to detect whether A completely dominates B w.r.t. R (i.e. if PDom(A, B, R) = 
1) regardless of the probability distributions assigned to the rectangular uncertainty regions. The state- 
of-the-art criterion to detect spatial domination on rectangular uncertainty regions is with the use of 
minimum/maximum distance approximations. This criterion states that A dominates B w.r.t. R if the 
minimum distance between R and B is greater than the maximum distance between R and A. Although 
correct, this criterion is not tight (cf. [15]), i.e. not each case where A dominates B w.r.t. R is detected 
by the min/max-domination criterion. The problem is that the dependency between the two distances 
between A and R and between B and R is ignored. Obviously, the distance between A and R as well as 
the distance between B and R depend on the location of R. However, since R can only have a unique 
location within its uncertainty region, both distances are mutually dependent. Therefore, we adopt the 
spatial domination concepts proposed in [15J for rectangular uncertainty regions. 

Corollary 1 (Complete Domination). Let A, B, R be uncertain objects with rectangular uncertainty 
regions. Then the following statement holds: 

d 

PDom{A, B,R) = l^y^ max {MaxDist{Ai, rif - MinDist{Bi, ViY) < 0, 

where Ai, Bi and Ri denote the projection interval of the respective rectangular uncertainty region of 
A, B and R on the i*^ dimension; {R^"'^) denotes the lower (upper) bound of interval Ri, and p 

corresponds to the used Lp norm. The junctions MaxDist{A, r) and MinDist{A, r) denote the maximal 
(respectively minimal) distance between the one -dimensional interval A and the one -dimensional point r. 

Corollary [T] follows directly from [[TSl : the inequality is true if and only if for all points aEA,bEB,rG 
R, a is closer to r than b. Translated into the possible worlds model, this is equivalent to the statement 
that A is closer to R than B for any possible world, which in return means that PDom(A, B, R) = 1. 

In addition, it holds that 

Corollary 2. 

PDom{A, B,R) = 1^ PDom{B, A,R) =0 

In the example depicted in Figure |2(a)[ the grey region on the right shows all points that definitely 
are closer to A than to B and the grey region on the left shows all points that definitely are closer to B 
than to A. Consequently, A dominates B (B dominates A) if R completely falls into the right (left) grey 
shaded half-space^ 



^Note that the grey regions are not explicitly computed; we only include them in Figure 



2(a) 



for illustration purpose. 





(a) Complete domination 



(b) Probabilistic domination 



Fig. 2. Similarity Domination. 



B. Probabilistic Domination 

Now, we consider the case where A does not completely dominate B w.r.t. R. In consideration of the 
possible world semantics, there may exist worlds in which A dominates B w.r.t. R, but not all possible 
worlds satisfy this criterion. Let us consider the example shown in Figure |2(b) where the uncertainty 
region of A is decomposed into five partitions, each assigned to one of the five grey-shaded regions 
illustrating which points are closer to the partition in A than to B. As we can see, R only completely falls 
into three grey-shaded regions. This means that A does not completely dominate B w.r.t. R. However, 
we know that in some possible worlds (at least in all possible words where A is located in Ai, A2 or A3) 
A does dominate B w.r.t. R. The question at issue is how to determine the probability PDom(A, B, R) 
that A dominates B w.r.t. R in an efficient way. The key idea is to decompose the uncertainty region of 
an object X into subregions for which we know the probability that X is located in that subregion (as 
done for object A in our example). Therefore, if neither Dom{A, B, R) nor Dom(B, A, R) holds, then 
there may still exist subregions A' C A, B' C B and R' C R such that A' dominates B' w.r.t. R'. Given 
disjunctive decomposition schemes A, B and ]Z we can identify triples of subregions (A' E A, B' E B, 
R' E TZ) for which Dom{A' , B' , R') holds. Let 6{A',B',R') be the following indicator function: 



6{A',B\R') 



1, if Dom{A' , B' , R') 
0, else 



Lemma 1. Let A, B and R be uncertain objects with disjunctive object decompositions A, B and TZ, 
respectively. To derive a lower bound PDoniLBiA, B, R) of the probability PDom{A,B,R) that A 
dominates B w.r.t. R, we can accumulate the probabilities of combinations of these subregions as follows : 



PDomLB{A,B,R) 



J2 P{ci e A') ■ P{h E B') ■ P{r E R') ■ 5{A, B' , R') 
A'eA,B'eB,R'en 



where P{X E X') denotes the probability that object X is located within the region X'. 

Proof: The probability of a combination (A', B' , R') can be computed by P{a E A')-P{h E B')-P{r E 
R') due to the assumption of mutually independent objects. These probabilities can be aggregated due to the 
assumption of disjunctive subregions, which implies that any two different combinations of subregions 
{A' E A,B' E B, R' E n) and {A" E A, B" E B, R" e TZ, A ^ A" y B' ^ B" y R' ^ R!' must 
represent disjunctive sets of possible worlds. It is obvious that all possible worlds defined by combinations 
{A\B',R!) where 5{A',B',R') = I, A dominates B w.r.t. R. But not all possible worlds where A 
dominates B w.r.t. R are covered by these combinations and, thus, do not contribute to PDorriLBiA, B, R). 
Consequently, PDomLB{A, B, R) lower bounds PDom{A, B, R). ■ 
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Fig. 3. Ai and A2 dominate B w.r.t. R withi a probability of 50%, respectively. 



Analogously, we can define an upper bound of PDom{A, B, R): 

Lemma 2. An upper bound P Domu b{,A, B , R) of PDom{A, B , R) can be derived as follows: 

PDomuB{A, B,R) = 1- PDorriLBiB, A, R) 

Naturally, the more refined the decompositions are, the tighter the bounds that can be computed and the 
higher the corresponding cost of deriving them. In particular, starting from the entire MBRs of the objects, 
we can progressively partition them to iteratively derive tighter bounds for their dependency relationships 
until a desired degree of certainty is achieved (based on some threshold). However, in the next section, we 
show that the derivation of the domination count DomCount(B, R) of a given object B (cf. Definition 
|3]), which is the main module of prominent probabilistic queries cannot be straightforwardly derived with 
the use of these bounds and we propose a methodology based on generating functions for this purpose. 



IV. Probabilistic Domination Count 



In Section III we described how to conservatively and progressively approximate the probability that 
A dominates B w.r.t. R. Given these approximations PDomLB{A, B, R) and PDomuB{A, B, R), the next 
problem is to cumulate these probabilities to get an approximation of the domination count DomCount^B, R) 
of an object B w.r.t. R (cf. Definition [3]). To give an intuition how challenging this problem is, we first 
present a naive solution that can yield incorrect results due to ignoring dependencies between domination 



relations in Section IV-A To avoid the problem of dependent domination relations, we first show in Section 



IV-B how to exploit object independencies to derive domination bounds that are mutually independent. 
Afterwards, in Section IV-C[ we introduce a new class of uncertain generating functions that can be used 
to derive bounds for the domination count efficiently, as we show in Section |IV-D[ Finally, in Section 



IV-E we show how to improve our domination count approximation by considering disjunct subsets of 
possible worlds for which a more accurate approximation can be computed. 



A. The Problem of Domination Dependencies 

To compute DomCount{B, R), a straightforward solution is to first approximate PDom{A, B, R) for 



all A E V using the technique proposed in Section III Then, given these probabilities we can apply 
the technique of uncertain generating functions (cf. Section IV-C) to approximate the probability that 
exactly 0, exactly 1, exactly n — 1 uncertain objects dominate B. However, this approach ignores 
possible dependencies between domination relationships. Although we assume independence between 
objects, the random variables Dom(Ai, B, R) and Dom(A2, B, R) are mutually dependent because the 
distance between Ai and R depends on the distance between A2 and R because object R can only appear 
once. Consider the following example: 

Example 1. Consider a database of three certain objects B, Ai and A2 and the uncertain reference 
object R, as shown in Figure |5] For simplicity, objects Ai and A2 have the same position in this example. 
The task is to determine the domination count of B w.r.t. R. The domination half- space for Ai and A2 is 
depicted here as well. Let us assume that Ai (A2) dominates B with a probability of PDom{Ai, B, R) = 
PDom{A2, B, R) = 50%. Recall that this probability can be computed by integration or approximated 



with arbitrary precision using the technique of Section III However, in this example, the probability that 
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both Ai and A2 dominate B is not simply 50% ■ 50% = 25%, as the generating function technique would 
return. 

The reason for the wrong result in this example, is that the generating function requires mutually 
independent random variables. However, in this example, it holds that if and only if R falls into the 
domination half-space of Ai, it also falls into the domination half-space of A2. Thus we have the 
dependency dom{Ai,B,R) -H- dom{A2, B, K) and the probability for R to be dominated by both Ai 
and A2 is 

P{dom{Ai,B,R)) ■ P{dom{A2,B,R)\dom{Ai,B,R)) = 0.5-1 =0.5. 



B. Domination Approximations Based on Independent Objects 

In general, domination relations may have arbitrary correlations. Therefore, we present a way to compute 
the domination count DomCount{B, R) while accounting for the dependencies between domination 
relations. 

Complete Domination: In an initial step, complete domination serves as a filter which allows us to 
detect those objects A E V that definitely dominate a specific object B w.r.t. R and those objects that 
definitely do not dominate B w.r.t. R by means of evaluating PDom{A, B, R). It is important to note that 
complete domination relations are mutually independent, since complete domination is evaluated on the 
entire uncertainty regions of the objects. After applying complete domination, we have detected objects 
that dominate B in all, or no possible worlds. Consequently, we get a first approximation of the domination 
count DoTnCount{B, R), obviously, it must be higher than the number N of objects that dominate B 
and lower than \T>\ — M, where M is the number of objects that dominate B in no possible world, i.e. 
P{DomCount{B, R) = k) = for k < N and k > \V\-M. Nevertheless, for iV < A; < \V-M\ we still 
have a very bad approximation of the domination count probability of < P(DomCount{B, R) = k) < 1. 

Probabilistic Domination: In order to refine this probability distribution, we have to take the set of 
influence objects influenceObjects = {Ai, Ac}, which neither completely prune B nor are completely 
dominated by B w.r.t. R. For each Ai E influenceObjects, < PDom(Ai, B, R) < 1. For these objects, 
we can compute probabilities PDom(Ai, B, R), PDom{Ac, B, R) according to the methodology in 



Section III However, due to the mutual dependencies between domination relations (cf. Section IV-A), 
we cannot simply use these probabilities directly, as they may produce incorrect results. However, we 
can use the observation that the objects Ai are mutually independent and each candidate object Ai only 
appears in a single domination relation Dom(Ai, B, R), Dom(Ac, B, R). Exploiting this observation, 
we can decompose the objects Ai, Ac only, to obtain mutually independent bounds for the probabilities 
PDom(Ai, B, R), PDom(Ac, B, R), as stated by the following lemma: 

Lemma 3. Let Ai, ...Ac be uncertain objects with disjunctive object decompositions Ai, ...,Ac, respec- 
tively. Also, let B and R be uncertain objects (without any decomposition). The lower (upper) bound 
PDomiB{Ai, B, R) (PDo'muB{Ai^B^R)) as defined in Lemma^(Lemma^ of the random variable 
Dom{Ai, B, R) is independent of the random variable DomlAj, B,R) (1 < i ^ j < C). 

Proof: Consider the random variable Dom{Ai, B, R) conditioned on the event Dom{Aj, B, R) = 1. 
Using Equation [T| we can derive the lower bound probability of Dom{Ai, B, R) = l\Dom{Aj, B, R) = 1 
as follows: 

PDomLB{Ai,B,R\Dom{Aj,B,R) = 1) = 
^ [P{ai E A[\Dom{Aj,B,R) = 1) ■ P(6 E B'\Dom{Aj, B, R) = 1)- 
A'^eA.,B'eB,R'en ^ R'\Dom{A„ B, R) = 1) ■ 5{A[, B', R')] 
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Now we exploit that B and R are not decomposed, thus B' = B and R' = R, and thus P{B E 

B'\Dom{Aj,B,R) = 1) = 1 = P{B e B') and P{R e R'\Dom{Aj,B,R) = 1) = 1 = P{R e R'). We 
obtain: 

PDomLB{Ai,B,R\Dom{Aj,B,R) = 1) = 
[^K e A\\Dom{Aj,B,R) = 1) ■ P{b G B') ■ P{r G R') ■ 6{A\,B',R')] 

Ar(iAi,B'<^B,R'en 

Next we exploit that P(aj G y4'jDom(Aj, S, _R) = 1) = P(aj G A\) since is independent from 
Dom(Aj, B, R) and obtain: 

PDomLB{Ai,B,R\Dom{Aj,B,R) = 1) = 

^ [P(ai G A'J ■ P{b G P') ■ P{r G P') • B' , R')] = PDomLsiA, B, R) 
A[eAi,B'eB,R'en 

Analogously, it can be shown that 

PDomuB{A,B,R\Dom{Aj,B,R) = 1) = PDomusiA, B, R). 

m 

In summary, we can now derive, for each object Ai a lower and an upper bound of the probability 
that Ai dominates B w.r.t. R. However, these bounds may still be rather loose, since we only consider 



the full uncertainty region of B and R so far, without any decomposition. In Section |IV-E[ we will show 
how to obtain more accurate, still mutual independent probability bounds based on decompositions of B 
and R. Due to the mutual independency of the lower and upper probability bounds, these probabilities 
can now be used to get an approximation of the domination count of B. In order to do this efficiently, 
we adapt the generating functions technique which is proposed in [19J. The main challenge here is to 
extend the generating function technique in order to cope with probability bounds instead of concrete 
probability values. It can be shown that a straightforward solution based on the existing generating 
functions technique applied to the lower/upper probability bounds in an appropriate way does solve the 
given problem efficiently, but overestimates the domination count probability and thus, does not yield good 
probability bounds. Rather, we have to redesign the generating functions technique such that lower/upper 
probability bounds can be handled correctly. 



C. Uncertain Generating Functions (UGFs) 

In this subsection, we will give a brief survey on the existing generating function technique (for more 
details refer to [|19il ) and then propose our new technique of uncertain generating functions. 

Generating Functions: Consider a set of mutually independent, but not necessarily identically 
distributed Bernoulli {0, 1} random variables Xi, ...,Xn- Let P{Xi) denote the probability that Xj = 1. 
The problem is to efficiently compute the sum 

N N 

Y,X, = Y,Dom{A,B,R) 

i=l 1=1 

of these random variables. A naive solution would be to count, for each < k < N, all combinations 
with exactly k occurrences of Xj = 1 and accumulate the respective probabilities of these combinations. 
This approach, however, shows a complexity of 0(2^). In [|5l, an approach was proposed that achieves an 
0{N) complexity using the Poisson Binomial Recurrence. Note that 0{N) time is asymptotically optimal 
in general, since the computation involves at least 0{N) computations, namely P{Xi), 1 < i < N. In the 
following, we propose a different approach that, albeit having the same linear asymptotical complexity, 
has other advantages, as we will see. We apply the concept of generating functions as proposed in the 
context of probabilistic ranking in [1191 . Consider the function J-'(x) = nr=i('^« + hx). The coefficient of 
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in F{x) is given by: Y.\0\=kX\i:0i=o^iY\i:i3i=i^i^ where ^ = {Pi,...,Pn) is a Boolean vector, and |^| 
denotes the number of I's in (3. 

Now consider the following generating function: 

r = l[{i- p{Xi) + p{Xi) ■x) = J2 Ci^'- 

Xi j>0 

The coefficient cj of x^ in the expansion of J^' is the probability that for exactly j random variables 
Xi it holds that Xi — l. Since J"^ contains at most i + 1 non-zero terms and by observing that 

jr^ ^ jn-i . (1 _ p(^Xi) + P{Xi) ■ x), 

we note that can be computed in 0{i) time given JR'^. Since — Ix^ — 1, we conclude that 
can be computed in 0{N'^) time. If only the first k coefficients are required (i.e. coefficients Cj where 
j < k), this cost can be reduced to 0{k • N), by simply dropping the summands cjx^ where j > k. 

Example 2. As an example, consider three independent random variables Xi, X2 and X^. Let P{Xi) — 
0.2, P{X2) = 0.1 and P{X^) = 0.3, and let k ^ 2. Then: 

JTi = jro . (0.8 + 0.2x) = 0.2x^ + 0.8x° 

jr2 ^ j-i . (0.9 + O.lx) = 0.02x^ + 0.26x^ + 0.72x° - 0.26x^ + 0.72x° 

= J'^ . (0.7 + 0.3x) = 0.078a;2 + 0.418a;^ + 0.504a;° 

= 0.418x^ + 0.504x° 

Thus, P{DomCount{B) = 0) = 50 A% and P{DomCount{B) = 1) = 41.8%. We obtain P{DomCount{B) < 
2) = 92.2%. Thus, B can be reported as a true hit if t is not greater than 92.2%. Equations marked by 
* exploit that we only need to compute the Cj where j < k = 2. 

Uncertain Generating Functions: Given a set of N independent but not necessarily identically dis- 
tributed Bernoulli {0, 1} random variables Xj, 1 < i < A^. Let PiBiXi) {PuB{Xi)) be a lower (upper) 
bound approximation of the probability P{Xi — 1). Consider the random variable 

AT 
i=l 

We make the following observation: The lower and upper bound probabilities PLB{Xi) and PuB{Xi) 
correspond to the probabilities of the three following events: 

• Xi = 1 definitely holds with a probability of at least PLB{Dom{Ai, B, R)). 

• Xi — definitely holds with a probability of at least 1 — PuB{Xi). 

. It is unknown whether Xi = or Xj = 1 with the remaining probability of PuB{Dom{Ai, B, R)) — 

PLB{Dom{A, B, Rj) = PDoniuBiA, B, R) - PDoniLBiA, B, R). 
Based on this observation, we consider the following uncertain generating function (UGF): 



= J] [{PLB{Xi) • X + (1 - PuB{Xi)) ■ y + {PuB{Xi) - PLB{Xi)))] = ^ C^jX^ . 
iel,...,N i,j>0 

The coefficient Cij has the following meaning: With a probability of Cij, B is definitely dominated 
at least i times, and possibly dominated another to j times. Therefore, the minimum probabifity that 
'^f^i Xi = k is Ckfl, since that is the probability that exactly k random variables Xi are 1. The maximum 
probability that J2iLi -^i = k is J2i<k i+j>k ^hJ^ ^he total probability of all possible combinations 
in which J^^iXj = k, may hold. Therefore, we obtain an approximated PDF of J2iLi-^i- the 
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P (I Xi = k) 

i=l 

80 %- 



70%. 
60%. 
50 %■ 
40%. 
30 %- 
20 %- 
10%- 



Fig. 4. Approximated PDF of X^i^i 



approximated PDF of X^j^^ Xj, each probability X]i=i -^i = ^ is given by a conservative and a progressive 
approximation. 

Example 3. Let Plb{Xi) = 20%, Pub{Xi) = 70%, Plb{X2) = 60% and Pub{X2) = 80%. ne 
generating Junction for the random variable Yl'i=i Xi is the following: 

= (0.2x + 0.5y + 0.3)(0.6x + 0.2y + 0.2) = 0.12x^ + OMx + 0.1 + 0.22xy + O.lQy + OMy^ 

That implies that, with a probability of at least 12%, X]^=i Xi = 2. In addition, with a probability o/22% 
plus 6%, it may hold that Yli=i Xi = 2, so that we obtain a probability bound of 12% — 40% for the 
random event X]i=i Xi = 2. Analogously, Yli=i Xi = 1 with a probability o/34% — 78% and Y^^i X^ = 
with a probability of 10% — 32%. The approximated PDF of Yli=i Xi is depicted in Figure U\ 

Each expansion J^' can be obtained from the expansion of J^'^^ as follows: 

[Plb{Xi) ■ X + (1 - Pub{Xi)) + {Pub{Xi) - Plb{Xi)) ■ y]. 

We note that J^' contains at most Xliti ^ non-zero terms (one Cij for each combination of i and j 
where i + j < I). Therefore, the total complexity to compute J^' is O(Z^). 



D. Efficient Domination Count Approximation using UGFs 

We can directly use the uncertain generating functions proposed in the previous section to derive bounds 
for the probability distribution of the domination count DomCount{B, R). Again, let V = Ai, Ajy be 
an uncertain object database and B and R be uncertain objects in M*^. Let Dom(Ai, B, R), 1 < i < N 
denote the random Bernoulli event that A^ dominates B w.r.t. R^ Also recall that the domination count 
is defined as the random variable that is the sum of the domination indicator variables of all uncertain 
objects in the database (cf. Definition |3]). 

Considering the generating function 

= J] [{PLBiDomiAi, B, R))-x + {PuB{Dom{Ai, B, R)) - PLBiDom{A, S, R))) ■ y) + 
'^''•■■'"^ {l-PuB{Dom{A,B,R)))] = J^CyxV, (1) 

ij>0 

^That is, X[Dom{Ai, B, R)] = 1 iff dominates B w.r.t. R and X[Dom{Ai, B, R)] = otlierwise. 



12 



we can efficiently compute lower and upper bounds of the probability that DomCount{B, R) = k for 



< A; < as discussed in Section IV-C and because the independence property of random variables 



required by the generating functions is satisfied due to Lemma [3] 

Lemma 4. A lower bound DomCount'l^{B, R) of the probability that DomCount{B , R) = k is given 
by 

DomCount\^{B,R) = Ckfl 
and an upper bound DomCount^jg^B^R) of the probability that DomCount{B , R) = k is given by 

DomCount'ljQ^B, R) = Cjj 

i<k,i+j>k 

Example 4. Assume a database containing uncertain objects A\, A2, B and R. The task is to determine 
a lower (upper) bound of the domination count probability DomCount'l^{B, R) (DomCount\jg{B,R)) 
of B w.r.t. R. Assume that, by decomposing Ai and A2 and using the probabilistic domination ap- 
proximation technique proposed in Section \III-B we determine that Ai has a minimum probability 



PDorriLBi^i, B, R) of dominating B of 20% and a maximum probability PDomuB{Ai, B , R) of 50%. 
For A2, PDomLB{A2,B,R) is 60% and PDomuB{A2, B, R) is 80%. By applying the technique in the 
previous subsection, we get the same generating function as in Example^and thus, the same approximated 
PDF for the DomCount{B , R) depicted in Figure^ 



To compute the uncertain generating function and thus the probabilistic domination count of an object 
in an uncertain database of size A^, the total complexity is 0{N^). The reason is that the maximal number 
of coefficients of the generating function is quadratic in x, since J^^ contains coefficients where 

2 

i + j < X, that is at most ^ coefficients. Since we have to compute J^^ for each (x < A^), the total time 
complexity is 0{N^). Note that only candidate objects c G Cand for which a complete domination cannot 



be detected (cf. Section IH-A) have to be considered in the generating functions. Thus, the total runtime 
to compute DomCount'l^{B, R) as well as DcnnCount^B^B, R) is 0{\Cand\^). In addition, we will 



show in Section VI how to reduce, specifically for fcNN and RA;NN queries, the total time complexity to 

0{k^ ■ \Cand\). 

Discussion: In the extended version of this paper ([|3l), we show that instead of applying the uncertain 
generating function to approximate the domination count of B, two regular generating functions can be 
used; one generating function that uses the progressive (lower) bounds PuBiDom(Ai, B, R)) and one that 
uses the conservative (upper) probability bounds PirBiDom(Ai, B, R)). However, we give an intuition 
and a formal proof that using regular generating functions yields looser bounds for the approximated 
domination. 



E. Efficient Domination Count Approximation Based on Disjunctive Worlds 

Since the uncertain objects B and R appear in each domination relation PDom{Ai, B, R), 



PDom{Ac, B, R) that is to evaluate, we cannot split objects B and R independently (cf. Section IV-A). 
The reason for this dependency is that knowledge about the predicate Dom(Ai, B, R) may impose 
constraints on the position of B and R. Thus, for a partition Bi C B, the probability PDom{Aj, Bi, R) 
may change given Dom{Ai, B, R) (l < i, j < C,i ^ j). However, note: 

Lemma 5. Given fixed partitions B' C B and R! C R, then the random variables Dom{Ai, B' , R') are 
mutually independent for I < i, j < C,i j. 

Proof: Similar to the proof of Lemma [3j ■ 
This allows us to individually consider the subset of possible worlds where b E B' and r E R' and use 
Lemma [5] to efficiently compute the approximated domination count probabilities DomCount^^^B' , R') 
and DomCount'lj^(B' , R') under the condition that B falls into a partition B' C B and R falls into 
a partition R' C R. This can be performed for each pair (B', R') E B x TZ, where B and TZ denote 
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the decompositions of B and R, respectively. Now, we can treat pairs of partitions {B', R') E B x 
TZ independently, since all pairs of partition represent disjunctive sets of possible worlds due to the 
assumption of a disjunctive partitioning. Exploiting this independency, the PDF of the domination count 
DomCount(B, R) of the total objects B and R can then be obtained by creating an uncertain generating 
function for each pair (B', R!) to derive a lower and an upper bound of P{DomCount{B' , R') = k) and 
then computing the weighted sum of these bounds as follows: 

DomCountlsiB, R) = ^ DomCount'lj^{B' , R') ■ P{B') ■ P{R'). 

B'eB,R'en 

The complete algorithm of our domination count approximation approach can be found in the next Section. 

V. Implementation 

Algorithm [T] is a complete method for iteratively computing and refining the probabilistic domination 
count for a given object B and a reference object R. The algorithm starts by detecting complete domination 



(cf. Section III-A). For each object that completely dominates B, a counter CompleteDominationCount 
is increased and each object that is completely dominated by B is removed from further consideration, 
since it has no influence on the domination count of B. The remaining objects, which may have a 
probability greater than zero and less than one to dominate B, are stored in a set influenceObjects. 
The set influenceObjects is now used to compute the probabilistic domination count (DomCountiB, 
DomCountuBl^ The main loop of the probabilistic domination count approximation starts in line 14. In 
each iteration, B, R, and all influence objects are partitioned. For each combination of partitions B' and R', 
and each database object Ai G influenceObjects, the probability PDom{Ai, B', R') is approximated (cf. 
Section |rV-B|). These domination probability bounds are used to build an uncertain generating function (cf. 



Section |rV-D| ) for the domination count of B' w.r.t. R'. Finally, these domination counts are aggregated for 
each pair of partitions B',R' into the domination count DomCount{B, R) (cf. Section IV-E). The main 
loop continues until a domain- and user-specific stop criterion is satisfied. For example, for a threshold 
fcNN query, a stop criterion is to decide whether the lower (upper) bound that B has a domination count 
of less than (at least) k, exceeds (falls below) the given threshold. 

The progressive decomposition of objects (line 15) can be facilitated by precomputed split points at the 
object PDFs. More specifically, we can iteratively split each object X by means of a median- split-based 
bisection method and use a kd-tree [0 to hierarchically organize the resulting partitions. The kd-tree is 
a binary tree. The root of a kd-tree represents the complete region of an uncertain object. Every node 
implicitly generates a splitting hyperplane that divides the space into two subspaces. This hyperplane is 
perpendicular to a chosen split axis and located at the median of the node's distribution in this axis. The 
advantage is that, for each node in the kd-tree, the probability of the respective subregion X' is simply 
given by 0.5"^ -levei-i^ where X'. lev el is the level of X'. In addition, the bounds of a subregion X' can 
be determined by backtracking to the root. In general, for continuously partitioned uncertain objects, the 
corresponding kd-tree may have an infinite height, however for practical reasons, the height h of the 
kd-tree is limited. The choice of /i is a trade-off between approximation quality and efficiency: for a very 
large h, considering each leaf node is similar to applying integration on the PDFs, which yields an exact 
result; however, the number of leaf nodes, and thus the worst case complexity increases exponentially in 



h. Note that our experiments (cf. Section VII) show that a low h value is sufficient to yield reasonably 
tight approximation bounds. Yet it has to be noted, that in the general case of continuous uncertainty, 
our proposed approach may only return an approximation of the exact probabilistic domination count. 
However, such an approximation may be sufficient to decide a given predicate as we will see in Section 



VI and even in the case where the approximation does not suffice to decide the query predicate, the 
approximation will give the user a confidence value, based on which a user may be able decide whether 
to include an object in the result. 



'^DomCountLB and DomCountuB are lists containing, at eacli position i, a lower and an upper bound for P[DomCount{B, R) = i), 
respectively. This notation is equivalent to a single uncertain domination count PDF. 
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Algorithm 1 Probabilistic Inverse Ranking 



Require: : Q, B,V 

in f luenceObj ects — 
CompleteDominationCount = 
//Complete Domination 
for all AieV Ao 

if DDCopu^ai{A,, B, R) then 

CompleteDominationCount++ 
else if -^DDCopuniai{B , At, R) then 

in f luenceObj ects = in f luenceObj ects n 
end if 
end for 

//probabilistic domination count 
DomCountLB= [0,...,0] //length jOj 
DomCountuB= //length |C| 

while ^ stopcriterion do 

split(i?), split(B), split(A, G 2?) 
for all B' e B, R' € R do 

candLB= [0,...,0] //length \uncertainObjects\ 
canduB= [1, .••,!] //length \uncertainObjects\ 
for all (0 < i < |in/Ziience06jects|) do 
Ai = in f luenceObj ects[i] 
for all yl ■ G do 

itDDCovUmai{A'i,B',R') then 

candLB[i]+=(P(Ai)) 
else if DDCopumai{B', A'„R') then 

cand[/s[i]-=(P(yl-)) 
end if 
end for 
end for 

compute DonnCountLB^B' , R!) and DomCountuB{B' , R') using UGFs. 
for all (0 < i < 2?) do 

DomCountLB[i\+=DomCount{B' , R')lb ■ P{B') ■ P{R') 
DomCountuB[i]+=DomCount\B' ,R')uB ■ P{B') ■ P{R') 
end for 
end for 

ShifiRight{DomCount LB, CompleteDominationCount) 
ShiftRight(DomCount(7 b ,C onipleteDominationC ount) 
end while 

return (DomCountLB , DomCountuB) 



VI. Applications 

In this section, we outline how the probabilistic domination count can be used to efficiently evaluate 
a variety of probabilistic similarity query types, namely the probabilistic inverse similarity ranking query 
[|2T1l . the probabilistic threshold /c-NN query [fTOl . the probabilistic threshold reverse /c-NN query and the 
probabilistic similarity ranking query [l4l|, fT4l|, fl9\, f25]. We start with the probabilistic inverse ranking 



query, because it can be derived trivially from the probabilistic domination count introduced in Section IV 
In the following, let V = {Ai, A^} be an uncertain database containing uncertain objects Ai, A^- 

Corollary 3. Let B and R be uncertain objects. The task is to determine the probabilistic ranking 
distribution Rank{B,R) of B w.r.t. to similarity to R, i.e. the distribution of the position Rank{B,R) of 
object B in a complete similarity ranking of Ai,..., An,B w.r.t. the distance to an uncertain reference 
object R. Using our techniques, we can compute Rank{B,R) as follows: 

P{Rank{B, R) = i) = P{DomCount{B, R) = i - 1) 

The above corollary is evident, since the proposition "_B has rank i" is equivalent to the proposition 
"5 is dominated by i — 1 objects". 

The most prominent probabilistic similarity search query is the probabilistic threshold A;NN query. 
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Corollary 4. Let Q = R be an uncertain query object and let k be a scalar. The problem is to find all 
uncertain objects kMN^lQ) that are the k-nearest neighbors of Q with a probability of at least r. Using 
our techniques, we can compute the probability P^^^ {B, Q) that an object B is a kNN of Q as follows: 

k-1 

P^^^{B,Q) = ^P{DomCount{B,Q) = i) 

1=0 

The above corollary is evident, since the proposition "5 is a kNN of Q" is equivalent to the proposition 
"5 is dominated by less than k objects". To decide whether 5 is a A;NN of Q, i.e. if i? G kNNT-iQ), we 
just need to check if P^^^{B, Q) > r. 

Next we show how to answer probabilistic threshold R/cNN queries. 

Corollary 5. Let Q = R be an uncertain query object and let k be a scalar The problem is to find 
all uncertain objects Ai that have Q as one of their kNNs with a probability of at least t, that is, all 
objects Aifor which it holds that Q G kNNr{Ai). Using our techniques, we can compute the probability 
pRkNN(^Q^ g) that an object B is a RkNN of Q as follows: 

k-1 

pRkNN^p^ Q) = J2 P{DomCount{Q, B) = i) 

1=0 

The intuition here is that an object i? is a RA;NN of Q if and only if Q is dominated less than k times 
w.r.t. B. 

For kNN and R/cNN queries, the total complexity to compute the uncertain generating function can 
be improved from 0{\Cand\^) to 0{\Cand\ ■ k"^) since it can be observed from Corollaries |4] and [s] 
that for kNN and R/eNN queries, we only require the section of the PDF of DomCount{B, R) where 
DomCount{B, R) < k, i.e. we only need to know the probabilities P{DomCount(B , R) = x),x < k. 
This can be exploited to improve the runtime of the computation of the PDF of DomCount{B, R) as 
follows: Consider the iterative computation of the generating functions J-'^, J^I^""'^L For each J-"', 1 < 
/ < \cand\, we only need to consider the coefficients Cij in the generating function J^* where i < k, since 
only these coefficients have an influence on P(DomCount{B, R) = x),x < k (cf. Section |4]). In addition, 
we can merge all coefficients Cij, Ci'j' where i = i',i + j>k and i' + j' > k, since all these coefficients 
only differ in their influence on the upper bounds of P(DomCount(B, R) = x),x > k, and are treated 
equally for P{DomCount{B , R) = x),x < k. Thus, each J^' contains at most X]i=i ^ coefficients (one 
Cij for each combination of i and j where i+j < k). Thus reducing the total complexity to 0{k'^ ■ \cand\). 

Finally, we show how to compute the expected rank (cf. [14J) of an uncertain object. 

Corollary 6. Let Q = R be an uncertain query object. The problem is to rank the uncertain objects Ai 
according to their expected rank E{Rank{Ai)) w.r.t. the distance to Q. The expected rank of an uncertain 
object Ai can be computed as follows: 

N-l 

E{Rank{A,)) = ^ P{DomCount{Q, B) = i) ■ {i + 1) 

i=0 

Other probabilistic similarity queries (e.g. A;NN and RA;NN queries with a different uncertainty predicate 
instead of a threshold r) can be approximated efficiently using our techniques as well. Details are omitted 
due to space constraints. 

VIL Experimental Evaluation 

In this section, we review the characteristics of the proposed algorithm on synthetic and real-world data. 
The algorithm will be referred to as IDCA (Iterative Domination Count Approximation). We performed 
experiments under various parameter settings. Unless otherwise stated, for 100 queries, we chose B to 
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Fig. 5. Runtime of MC for increasing sample size. 



be the object with the 10 smallest MinDist to the reference object R. We used a synthetic dataset with 
10,000 objects modeled as 2D rectangles. The degree of uncertainty of the objects in each dimension 
is modeled by their relative extent. The extents were generated uniformly and at random with 0.004 
as maximum value. For the evaluation on real-world data, we utilized the International Ice Patrol (IIP) 
Iceberg Sightings Datasej^ This dataset contains information about iceberg activity in the North Atlantic 
in 2009. The latitude and longitude values of sighted icebergs serve as certain 2D mean values for the 
6,216 probabilistic objects that we generated. Based on the date and the time of the latest sighting, we 
added Gaussian noise to each object, such that the passed time period since the latest date of sighting 
corresponds to the degree of uncertainty (i.e. the extent). The extents were normalized w.r.t. the extent of 
the data space, and the maximum extent of an object in either dimension is 0.0004. 



A. Runtime of the Monte-Carlo-based Approach 

To the best of our knowledge, there exists no approach which is able to process uncertain similarity 
queries on probabilistic databases with continuous PDFs. A naive approach needs to consider all possible 
worlds and thus needs to integrate over all object PDFs, implying a runtime exponentially in the number 
of objects. Since this is not applicable even for small databases, we adapted an existing approach to cope 
with the conditions. The approach most related to our work is [21J, which solves the problem of computing 
the domination count for a certain query and discrete distributions within the database objects. Thus the 
proposed comparison partner works as follows: Draw a sufficiently large number S of samples from 
each object by Monte-Carlo-Sampling. Then, for each sample qi G Q of the query, apply the algorithm 
proposed in [|2T1l to compute an exact probabilistic domination count PDF of an object B. As proposed in 
[|2Tll . this is done using the generating function technique and using an and/xor tree to combine individual 
samples into discrete distributed uncertain objects. Finally, accumulate the resulting certain domination 
count PDFs of each qi e Q into a single domination count PDF by taking the average. The execution 
time for this approach, which we will refer to as MC in the following, is shown in Figure |5j It can be 
observed that for a reasonable sample size (which is required to achieve a result that is close to the correct 
result with high probability) the runtime becomes very large. 

Note that our comparison partner only works for discrete uncertain data (cf. Section |VII-A[ ). To make 
a fair comparison our approach relies on the same uncertainty model (default: 1000 samples/object). 
Nevertheless, all the experiments yield analogous results for continuous distributions. 



B. Optimal vs. Min/Max Decision Criterion 

In the first experiment, we evaluate the gain of pruning power using the complete similarity domination 
technique (cf. Section |III-A[) instead of the state-of-the-art min/max decision criterion to prune uncertain 



^The IIP dataset is available at the National Snow and Ice Data Center (NSIDC) web site (http://nsidc.org/data/g00807.htmi). 
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Fig. 6. Optimal vs. MinMax decision criterion. 
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(a) Synthetic Data 
Fig. 7. Uncertainty of IDCA w.r.t. the relative runtime to MC. 



(b) Real Data 



objects from the search space. The first experiment evaluates the number of uncertain objects that cannot be 
pruned using complete domination only, that is the number of candidates are to evaluate in our algorithm. 



Figure 6(a) shows that our domination criterion (in the following denoted as optimal) is able to prune 
about 20% more candidates than the min/max pruning criterion. In addition, we evaluated the domination 
count approximation quality (in the remainder denoted as uncertainty) after each decomposition iteration 
of the algorithm, which is defined as the sum J^iLo DomCountlj^{B , R) — DomCounf^^{B, R). The 
result is shown in Figure |6(b)[ The improvement of the complete domination (denoted as iteration 0) can 
also be observed in further iterations. After enough iterations, the uncertainty converges to zero for both 
approaches. 



C. Iterative Domination Count Approximation 

Next, we evaluate the trade-off of our approach regarding approximation quality and the invested runtime 
of our domination count approximation. The results can be seen in Figure |7] for different sample sizes and 
datasets. It can be seen that initially, i.e. in the first iterations, the average approximation quality (avg. 
uncertainty of an influenceObject) decreases rapidly. The less uncertainty left, the more computational 
power is required to reduce it any further. Except for the last iteration (resulting in uncertainty) each 



of the previous iterations is considerably faster than MC. In some cases (see Figure 7(b) I IDCA is even 
faster in computing the exact result. 
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Fig. 8. Runtimes of IDCA and MC for different query predicates k and r. 
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(a) Runtime w.r.t. number of influence objects. 
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Fig. 9. Impact of influencing objects. 



D. Queries with a Predicate 

Integrated in an application one often wants to decide whether an object satisfies a predicate with a 
certain probability. In the next experiment, we posed queries in the form: Is object B among the k nearest 
neighbors of Q (predicate) with a probability of 25%, 50%, 75%? The results are shown in Figure [8] for 
various fc-values. With a given predicate, IDCA is often able to terminate the iterative refinement of the 
objects earlier in most of the cases, which results in a runtime which is orders of magnitude below MC. 
In average the runtime is below MC in all settings. 



E. Number of influenceObjects 

The runtime of the algorithm is mainly dependent on the number of objects which are responsible for 
the uncertainty of the rank of B. The number of influenceObjects depends on the number of objects in 
the database, the extension of the objects and the distance between Q and B. The larger this distance, the 
higher the number of influenceObjects . For the experiments in Figure 9(a) we varied the distance between 
Q and B and measured the runtime for each iteration. In Figure 9(b) we present runtimes for different 
sizes of the database. The maximum extent of the objects was set to 0.002 and the number of objects in 
the database was scaled from 20,000 to 100,000. Both experiments show that IDCA scales well with the 
number influencing objects. 



VIII. Conclusions 

In this paper, we applied the concept of probabilistic similarity domination on uncertain data. We 
introduced a geometric pruning filter to conservatively and progressively approximate the probability 
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that an object is being dominated by another object. An iterative filter-refinement strategy is used to 
stepwise improve this approximation in an efficient way. Specifically we propose a method to efficiently 
and effectively approximate the domination count of an object using a novel technique of uncertain 
generating functions. We show that the proposed concepts can be used to efficiently answer a wide range 
of probabilistic similarity queries while keeping correctness according to the possible world semantics. 
Our experiments show that our iterative filter-refinement strategy is able to achieve a high level of precision 
at a low runtime. As future work, we plan to investigate further heuristics for the refinement process in 
each iteration of the algorithm. Furthermore we will integrate our concepts into existing index supported 
kNN- and R/cNN-query algorithms. 
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